Text Categorization for Improved Priors of Word Meaning
نویسندگان
چکیده
Distributions of the senses of words are often highly skewed. This fact is exploited by word sense disambiguation (WSD) systems which back off to the predominant (most frequent) sense of a word when contextual clues are not strong enough. The topic domain of a document has a strong influence on the sense distribution of words. Unfortunately, it is not feasible to produce large manually sense-annotated corpora for every domain of interest. Previous experiments have shown that unsupervised estimation of the predominant sense of certain words using corpora whose domain has been determined by hand outperforms estimates based on domain-independent text for a subset of words and even outperforms the estimates based on counting occurrences in an annotated corpus. In this paper we address the question of whether we can automatically produce domain-specific corpora which could be used to acquire predominant senses appropriate for specific domains. We collect the corpora by automatically classifying documents from a very large corpus of newswire text. Using these corpora we estimate the predominant sense of words for each domain. We first compare with the results presented in [1]. Encouraged by the results we start exploring using text categorization for WSD by evaluating on a standard data set (documents from the SENSEVAL-2 and 3 English all-word tasks). We show that for these documents and using domain-specific predominant senses, we are able to improve on the results that we obtained with predominant senses estimated using general, non domain-specific text. We also show that the confidence of the text classifier is a good indication whether it is worthwhile using the domain-specific predominant sense or not.
منابع مشابه
Text Categorization
Text categorization is the task of assigning predefined categories to natural language text. With the widely used “bag-ofword” representation, previous researches usually assign a word with values that express whether this word appears in the document concerned or how frequently this word appears. Although these values are useful for text categorization, they have not fully expressed the abunda...
متن کاملConnecting the Dots: Mass, Energy, Word Meaning, and Particle-Wave Duality
With insight from linguistics that degrees of text cohesion are similar to forces in physics, and the frequent use of the energy concept in text categorization by machine learning, we consider the applicability of particle-wave duality to semantic content inherent in
متن کاملSemantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
In this paper, we propose a novel topic model based on incorporating dictionary definitions. Traditional topic models treat words as surface strings without assuming predefined knowledge about word meaning. They infer topics only by observing surface word co-occurrence. However, the co-occurred words may not be semantically related in a manner that is relevant for topic coherence. Exploiting di...
متن کاملIranian EFL Learners’ Lexical Inferencing Strategies at Both Text and Sentence levels
Lexical inferencing is one of the most important strategies in vocabulary learning and it plays an important role in dealing with unknown words in a text. In this regard, the aim of this study was to determine the lexical inferencing strategies used by Iranian EFL learners when they encounter unknown words at both text and sentence levels. To this end, forty lower intermediate students were div...
متن کاملUsing Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization
This paper investigates the use of conceptbased representations for text categorization. We introduce a new approach to create concept-based text representations, and apply it to a standard text categorization collection. The representations are used as input to a Support Vector Machine classifier, and the results show that there are certain categories for which concept-based representations co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007